
Data Cleaning
D206
March 16, 2023

Contents
Section A: Do customers
enrolled in more service offerings display lower levels of churn?
Section C.1: Techniques
& Steps
Section C.2: Justification of
Methods
Section C.3: Programming
Environment
Integrated Development
Environment
Section D.1: Data Quality
Findings
Quantitative Data: Uniform
Distribution
Quantitative Data: Skewed
& Bi-Modal Distribution
Re-Expression of
Categorical Variables
Section D.4: Verification
and Outcome
Section D.7: Effect on the
Research Question
Section E.1: Principal
Component Analysis
Part Four: Supporting
Documents
The dataset for this project
consists of AnyTelecom Company's observations on
customers and customer churn. As the project's data analyst,
my task is to prepare the data for explatory
data analysis and ultimately answer the following research question: Do
customers enrolled in more service offerings display lower levels of churn?
Churn (the measure of customers
who stop using a product or service) can have a significant impact on a
company's business performance. It is in AnyTelecom Company's
best interest to examine the potential relationship between the number of
service offerings a customer is enrolled in and the rate of churn.
|
Variable Name |
Data Type |
Description |
Example |
|
|
0 |
CaseOrder |
Quantitative |
preserves order of raw data |
1 |
|
1 |
Customer_id |
Qualitative |
unique customer id |
K409198 |
|
2 |
Interaction |
Qualitative |
unique id associated with customer transactions |
aa90260b-4141-4a24-8e36-b04ce1f4f77b |
|
3 |
City |
Qualitative |
city of residence |
Point Baker |
|
4 |
State |
Qualitative |
state of residence |
AK |
|
5 |
County |
Qualitative |
county of residence |
Prince of Wales-Hyder |
|
6 |
Zip |
Quantitative |
zip code of residence |
99927 |
|
7 |
Lat |
Quantitative |
gps latitude coordinates |
56.251 |
|
8 |
Lng |
Quantitative |
gps longitude coordinates |
56.251 |
|
9 |
Population |
Quantitative |
population within mile radius |
38 |
|
10 |
Area |
Quantitative |
area type |
Urban |
|
11 |
TimeZone |
Qualitative |
customer time zone |
America/Sitka |
|
12 |
Job |
Qualitative |
customer job |
Environmental health practitioner |
|
13 |
Children |
Qualitative |
number of children in household |
NaN |
|
14 |
Age |
Quantitative |
customer age |
68 |
|
15 |
Education |
Quantitative |
highest degree earned by customer |
Master's Degree |
|
16 |
Employment |
Qualitative |
customer employment status |
Part Time |
|
17 |
Income |
Qualitative |
customer annual income |
28561.99 |
|
18 |
Marital |
Quantitative |
customer marital status |
Widowed |
|
19 |
Gender |
Qualitative |
customer gender identification |
Male |
|
20 |
Churn |
Qualitative |
whether service discontinued in last month |
No |
|
21 |
Outage_sec_perweek |
Qualitative |
system outages seconds per week |
6.972566093 |
|
22 |
Email |
Quantitative |
number of emails sent to customer in past year |
10 |
|
23 |
Contacts |
Quantitative |
number of times customer contacted technical s... |
0 |
|
24 |
Yearly_equip_failure |
Quantitative |
number of times customer equipment failed in p... |
1 |
|
25 |
Techie |
Quantitative |
customer technical aptitude |
No |
|
26 |
Contract |
Qualitative |
contract term |
One year |
|
27 |
Port_modem |
Qualitative |
customer has portable modem |
Yes |
|
28 |
Tablet |
Qualitative |
customer owns tablet |
Yes |
|
29 |
InternetService |
Qualitative |
customer internet service type |
Fiber Optic |
|
30 |
Phone |
Qualitative |
customer phone service |
Yes |
|
31 |
Multiple |
Qualitative |
customer has multiple lines |
No |
|
32 |
OnlineSecurity |
Qualitative |
customer has online security |
Yes |
|
33 |
OnlineBackup |
Qualitative |
customer has online backup |
Yes |
|
34 |
DeviceProtection |
Qualitative |
customer has device protection |
No |
|
35 |
TechSupport |
Qualitative |
customer has technical support |
No |
|
36 |
StreamingTV |
Qualitative |
customer has tv streaming |
No |
|
37 |
StreamingMovies |
Qualitative |
customer has movie streaming |
Yes |
|
38 |
PaperlessBilling |
Qualitative |
customer has paperless billing |
Yes |
|
39 |
PaymentMethod |
Qualitative |
customer payment method |
Credit Card (automatic) |
|
40 |
Tenure |
Qualitative |
number of months customer stayed with provider |
6.795512947 |
|
41 |
MonthlyCharge |
Quantitative |
amount charged to customer monthly |
171.4497621 |
|
42 |
Bandwidth_GB_Year |
Quantitative |
average amount of data used in past year |
904.5361102 |
|
43 |
Item1 |
Quantitative |
timely response survey result |
5 |
|
44 |
Item2 |
Quantitative |
timely fixes survey result |
5 |
|
45 |
Item3 |
Quantitative |
timely replacements survey result |
5 |
|
46 |
Item4 |
Quantitative |
reliability survey result |
3 |
|
47 |
Item5 |
Quantitative |
options survey result |
4 |
|
48 |
Item6 |
Quantitative |
respectful response survey result |
4 |
|
49 |
Item7 |
Quantitative |
courteous exchange survey result |
3 |
The quality of the data will be
assessed using the following techniques:
The pandas.DataFrame.duplicated() function will be applied
on the dataset to identify 1) duplication across entire records and 2)
duplication across Customer_id values.
The pandas.DataFrame.isnull.sum()
function will be applied to identify the number of missing values in each
column of the dataset. Columns that are found to contain missing values will be
visualized using the missingno.matrix()
function.
Outliers will be identified in
two stages. First, boxplots will be used to visually identify outliers for
relevant numeric variables in the dataset; these boxplots will be produced
using the pandas.DataFrame.boxplot()
function. Next, the number of outliers in each feature will be calculated by
applying the scipy.stats.zscore()
function on each column of the dataset.
The pandas.DataFrame.duplicated() function returns a Series
that describes which records in the dataset are duplicated and which records
are not
The pandas.DataFrame.isnull.sum()
function returns the sum of missing values for each column in the dataset
Boxplots are used to represent
the spread of data and allow one to visually assess the presence of outliers in
a sample. The pandas.DataFrame.boxplot()
function can be used to produce boxplots for numerical data in the dataset
In statistics, z-scores describe
the numeric relationship between a value and the mean. When combined with the
standard cutoff value (+/-3), z-scores can also determine if a value is an
outlier
Python will be used to clean the
dataset. Python is an open-source, robust programming language that includes
several libraries and packages used to perform data analytics, data science,
data engineering, and machine learning.
The following libraries/packages
will be used in conjunction with Python:
JupyterLab will be used as the Integrated
Development Environment. JupyterLab is a web-based,
open-source data science tool that allows users to write and share code in the
form of a notebook.
In [2]:
# Title: panda-dev/pandas# Author: The pandas development team# Date: 2023# Code Version: latest# Availability: https://doi.org/10.5281/zenodo.7741580
# import pandas library
import pandas as pd
In [3]:
# read churn data into DataFrame
df = pd.read_csv('source files/churn_raw_data.csv', header='infer')
In [4]:
# get shape (rows & columns) of the DataFrame
df.shape
Out[4]:
(10000, 52)In [5]:
# get header and first 5 rows of the DataFrame
df.head(5)
Out[5]:
|
Unnamed: 0 |
CaseOrder |
Customer_id |
Interaction |
City |
State |
County |
Zip |
Lat |
Lng |
... |
|
|
0 |
1 |
1 |
K409198 |
aa90260b-4141-4a24-8e36-b04ce1f4f77b |
Point Baker |
AK |
Prince of Wales-Hyder |
99927 |
56.25100 |
-133.37571 |
... |
|
1 |
2 |
2 |
S120509 |
fb76459f-c047-4a9d-8af9-e0f7d4ac2524 |
West Branch |
MI |
Ogemaw |
48661 |
44.32893 |
-84.24080 |
... |
|
2 |
3 |
3 |
K191035 |
344d114c-3736-4be5-98f7-c72c281e2d35 |
Yamhill |
OR |
Yamhill |
97148 |
45.35589 |
-123.24657 |
... |
|
3 |
4 |
4 |
D90850 |
abfa2b40-2d43-4994-b15a-989b8c79e311 |
Del Mar |
CA |
San Diego |
92014 |
32.96687 |
-117.24798 |
... |
|
4 |
5 |
5 |
K662701 |
68a861fd-0d20-4e51-a587-8a90407ee574 |
Needville |
TX |
Fort Bend |
77461 |
29.38012 |
-95.80673 |
... |
5 rows × 52 columns
In [6]:
# view information about the DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 10000 entries, 0 to 9999
Data columns (total 52 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 10000 non-null int64 1 CaseOrder 10000 non-null int64 2 Customer_id 10000 non-null object 3 Interaction 10000 non-null object 4 City 10000 non-null object 5 State 10000 non-null object 6 County 10000 non-null object 7 Zip 10000 non-null int64 8 Lat 10000 non-null float64 9 Lng 10000 non-null float64 10 Population 10000 non-null int64 11 Area 10000 non-null object 12 Timezone 10000 non-null object 13 Job 10000 non-null object 14 Children 7505 non-null float64 15 Age 7525 non-null float64 16 Education 10000 non-null object 17 Employment 10000 non-null object 18 Income 7510 non-null float64 19 Marital 10000 non-null object 20 Gender 10000 non-null object 21 Churn 10000 non-null object 22 Outage_sec_perweek 10000 non-null float64 23 Email 10000 non-null int64 24 Contacts 10000 non-null int64 25 Yearly_equip_failure 10000 non-null int64 26 Techie 7523 non-null object 27 Contract 10000 non-null object 28 Port_modem 10000 non-null object 29 Tablet 10000 non-null object 30 InternetService 10000 non-null object 31 Phone 8974 non-null object 32 Multiple 10000 non-null object 33 OnlineSecurity 10000 non-null object 34 OnlineBackup 10000 non-null object 35 DeviceProtection 10000 non-null object 36 TechSupport 9009 non-null object 37 StreamingTV 10000 non-null object 38 StreamingMovies 10000 non-null object 39 PaperlessBilling 10000 non-null object 40 PaymentMethod 10000 non-null object 41 Tenure 9069 non-null float64 42 MonthlyCharge 10000 non-null float64 43 Bandwidth_GB_Year 8979 non-null float64 44 item1 10000 non-null int64 45 item2 10000 non-null int64 46 item3 10000 non-null int64 47 item4 10000 non-null int64 48 item5 10000 non-null int64 49 item6 10000 non-null int64 50 item7 10000 non-null int64 51 item8 10000 non-null int64 dtypes: float64(9), int64(15), object(28)
memory usage: 4.0+ MBIn [7]:
# return duplicated rows
df[df.duplicated() == True]
Out[7]:
|
Unnamed: 0 |
CaseOrder |
Customer_id |
Interaction |
City |
State |
County |
Zip |
Lat |
Lng |
... |
0 rows × 52 columns
In [8]:
# check for records with duplicates in Customer_id column
df[df.duplicated(subset='Customer_id') == True]
Out[8]:
|
Unnamed: 0 |
CaseOrder |
Customer_id |
Interaction |
City |
State |
County |
Zip |
Lat |
Lng |
... |
0 rows × 52 columns
In [9]:
# find the sum of missing values for each column
df.isnull().sum()
Out[9]:
Unnamed: 0 0CaseOrder 0
Customer_id 0
Interaction 0City 0State 0County 0Zip 0Lat 0Lng 0
Population 0Area 0Timezone 0
Job 0Children 2495Age 2475Education 0Employment 0Income 2490Marital 0Gender 0Churn 0Outage_sec_perweek 0
Email 0Contacts 0Yearly_equip_failure 0
Techie 2477Contract 0Port_modem 0
Tablet 0InternetService 0
Phone 1026Multiple 0OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 991
StreamingTV 0
StreamingMovies 0
PaperlessBilling 0
PaymentMethod 0
Tenure 931MonthlyCharge 0
Bandwidth_GB_Year 1021
item1 0item2 0item3 0item4 0item5 0item6 0item7 0item8 0dtype: int64
In [10]:
# install missingno using pip
# !pip install missingno
In [11]:
# Title: ResidentMario/missingno# Author: Bilogur, et al.# Date: 2018# Code Version: latest# Availability: https://doi.org/10.5281/zenodo.1184723
# import missingno to visualize missing values
import missingno as msno
In [12]:
# isolate columns with missing values and add them to a new DataFrame
df_missing = df.loc[:, df.isnull().sum() > 0]
df_missing.isnull().sum()
Out[12]:
Children 2495Age 2475Income 2490Techie 2477Phone 1026TechSupport 991
Tenure 931Bandwidth_GB_Year 1021
dtype: int64
In [13]:
# generate matrix of missing values
msno.matrix(df_missing)
Out[13]:
<Axes: >
In [14]:
# use histograms to check columns in df_missing for skewness
df_missing.hist(figsize=(15,2), layout=(1,5))
Out[14]:
array([[<Axes: title={'center': 'Children'}>,
<Axes: title={'center': 'Age'}>, <Axes: title={'center': 'Income'}>, <Axes: title={'center': 'Tenure'}>, <Axes: title={'center': 'Bandwidth_GB_Year'}>]], dtype=object)
In [15]:
# isolate columns with numerical data relevant for outlier analysis
df_numericals = df[['Children',
'Age', 'Income', 'Outage_sec_perweek', 'Email', 'Contacts', 'Yearly_equip_failure', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year']] df_numericals.head(5)
Out[15]:
|
Children |
Age |
Income |
Outage_sec_perweek |
Email |
Contacts |
Yearly_equip_failure |
Tenure |
MonthlyCharge |
… |
|
|
0 |
NaN |
68.0 |
28561.99 |
6.972566 |
10 |
0 |
1 |
6.795513 |
171.449762 |
… |
|
1 |
1.0 |
27.0 |
21704.77 |
12.014541 |
12 |
0 |
1 |
1.156681 |
242.948015 |
… |
|
2 |
4.0 |
50.0 |
NaN |
10.245616 |
9 |
0 |
1 |
15.754144 |
159.440398 |
… |
|
3 |
1.0 |
48.0 |
18925.23 |
15.206193 |
15 |
2 |
0 |
17.087227 |
120.249493 |
… |
|
4 |
0.0 |
83.0 |
40074.19 |
8.960316 |
16 |
2 |
1 |
1.670972 |
150.761216 |
… |
In [16]:
df_numericals.boxplot(column='Children')
Out[16]:
<Axes: >
In [17]:
df_numericals.boxplot(column='Age')
Out[17]:
<Axes: >
In [18]:
df_numericals.boxplot(column='Income')
Out[18]:
<Axes: >
In [19]:
df_numericals.boxplot(column='Outage_sec_perweek')
Out[19]:
<Axes: >
In [20]:
df_numericals.boxplot(column='Email')
Out[20]:
<Axes: >
In [21]:
df_numericals.boxplot(column='Contacts')
Out[21]:
<Axes: >
In [22]:
df_numericals.boxplot(column='Yearly_equip_failure')
Out[22]:
<Axes: >
In [23]:
df_numericals.boxplot(column='Tenure')
Out[23]:
<Axes: >
In [24]:
df_numericals.boxplot(column='MonthlyCharge')
Out[24]:
<Axes: >
In [25]:
df_numericals.boxplot(column='Bandwidth_GB_Year')
Out[25]:
<Axes: >
In [26]:
# Title: scipy/scipy: Scipy# Author: Gommers, et al.# Date: 2023# Code Version: latest# Availability: https://doi.org/10.5281/zenodo.7655153
# import scipy.stats to calculate z-scores
from scipy import stats
In [27]:
# create a new DataFrame to hold z-scores for columns that contain outliers
data = {
'Children':[], 'Income':[], 'Outage_sec_perweek':[], 'Email':[], 'Contacts':[], 'Yearly_equip_failure':[], 'MonthlyCharge':[] } df_zscores = pd.DataFrame(data)
In [28]:
# add z-scores to df_zscores
df_zscores['Children'] = stats.zscore(df_numericals['Children'], nan_policy='omit')
df_zscores['Income'] = stats.zscore(df_numericals['Income'], nan_policy='omit')
df_zscores['Outage_sec_perweek'] = stats.zscore(df_numericals['Outage_sec_perweek'], nan_policy='omit')
df_zscores['Email'] = stats.zscore(df_numericals['Email'], nan_policy='omit')
df_zscores['Contacts'] = stats.zscore(df_numericals['Contacts'], nan_policy='omit')
df_zscores['Yearly_equip_failure'] = stats.zscore(df_numericals['Yearly_equip_failure'], nan_policy='omit')
df_zscores['MonthlyCharge'] = stats.zscore(df_numericals['MonthlyCharge'], nan_policy='omit')
df_zscores.head(5)
Out[28]:
|
Children |
Income |
Outage_sec_perweek |
Email |
Contacts |
Yearly_equip_failure |
MonthlyCharge |
|
|
0 |
NaN |
-0.401133 |
-0.637726 |
-0.666282 |
-1.005852 |
0.946658 |
-0.060613 |
|
1 |
-0.508646 |
-0.642955 |
0.079935 |
-0.005288 |
-1.005852 |
0.946658 |
1.589348 |
|
2 |
0.883715 |
NaN |
-0.171849 |
-0.996779 |
-1.005852 |
0.946658 |
-0.337752 |
|
3 |
-0.508646 |
-0.740976 |
0.534225 |
0.986203 |
1.017588 |
-0.625864 |
-1.242158 |
|
4 |
-0.972766 |
0.004846 |
-0.354795 |
1.316700 |
1.017588 |
0.946658 |
-0.538041 |
In [29]:
# return records where absolute value of z-score is greater than or equal to 3, Children column
children_outliers = df_zscores[df_zscores['Children'].abs() > 3]['Children']
children_outliers.count()
Out[29]:
144In [30]:
# return records where absolute value of z-score is greater than or equal to 3, Income column
income_outliers = df_zscores[df_zscores['Income'].abs() > 3]['Income']
income_outliers.count()
Out[30]:
110In [31]:
# return records where absolute value of z-score is greater than or equal to 3, Outage_sec_perweek column
outage_outliers = df_zscores[df_zscores['Outage_sec_perweek'].abs() > 3]['Outage_sec_perweek']
outage_outliers.count()
Out[31]:
491In [32]:
# return records where absolute value of z-score is greater than or equal to 3, Email column
email_outliers = df_zscores[df_zscores['Email'].abs() > 3]['Email']
email_outliers.count()
Out[32]:
12In [33]:
# return records where absolute value of z-score is greater than or equal to 3, Contacts column
contacts_outliers = df_zscores[df_zscores['Contacts'].abs() > 3]['Contacts']
contacts_outliers.count()
Out[33]:
165In [34]:
# return records where absolute value of z-score is greater than or equal to 3, Yearly_equip_failure column
failure_outliers = df_zscores[df_zscores['Yearly_equip_failure'].abs() > 3]['Yearly_equip_failure']
failure_outliers.count()
Out[34]:
94In [35]:
# return records where absolute value of z-score is greater than or equal to 3, MonthlyCharge column
charge_outliers = df_zscores[df_zscores['MonthlyCharge'].abs() > 3]['MonthlyCharge']
charge_outliers.count()
Out[35]:
3No duplicates were found.
Missing values were found for
the following features:
The following features contained
outliers:
Since no duplicates were found,
it was not necessary to implement methods to handle duplicate values.
Depending on the data type and
characteristics of the data, different methods were used to handle missing
values.
Features of the qualitative data
type were:
As it is not possible to compute
the mean or median on categorical data, missing values for these features were
imputed using the mode
Age was the only uniform feature
of the quantitative data type. To preserve the shape of this feature, missing
values were imputed using random values. This data cleaning step was performed
in several phases. First, the numpy.random.randint()
function was used to create an array of random integers within the range of
Age. Next, a new DataFrame was created; this DataFrame contained the records where the Age variable was
null. Finally, the random array of integers was inserted into the new DataFrame, and the values of the DataFrame
were used to replace NA/NaN values in Age by applying
the pandas.DataFrame.fillna()
function on the column.
Features of the quantitative
data type were:
To avoid further skewing the
distribution of these features, the median, rather than the mean, was used to
impute missing values for these features
If there was indication that a
set of outliers were the result of error or illegitimate entry, records
containing these outliers were removed from the dataset
For this project, illegitimacy
was determined based on the volume of outliers found in a
given feature. It was assumed that if a feature contained a sparse
number of outliers, then those outliers were illegitimate. In Section C.4:
Outliers, queries were performed to identify the number of outliers in each
feature.
The features MonthlyCharge
and Email contained relatively few outliers (3 and 12, respectively). Records
containing these outliers were removed from the dataset. To do this, the pandas.DataFrame.drop() function
was applied on the MonthlyCharge and Email columns
for records that contained outliers.
Each of the following features
contained more than 100 outliers:
These features were retained in
the dataset.
In [36]:
# fill missing values for the Techie, Phone, and TechSupport using the mode
df['Techie'].fillna(df['Techie'].mode()[0], inplace=True)
df['Phone'].fillna(df['Phone'].mode()[0], inplace=True)
df['TechSupport'].fillna(df['TechSupport'].mode()[0], inplace=True)
In [37]:
# verify Techie, Phone, and TechSupport do not contain missing values
df[['Techie', 'Phone', 'TechSupport']].isnull().sum()
Out[37]:
Techie 0Phone 0TechSupport 0
dtype: int64
In [38]:
# get rage of Age feature
df['Age'].describe()
Out[38]:
count 7525.000000mean 53.275748std 20.753928min 18.00000025% 35.00000050% 53.00000075% 71.000000max 89.000000Name: Age, dtype: float64In [39]:
# Title: Numpy – Annual Update# Author: Inessa Pawson# Date: 2021# Code Version: latest# Availability: https://doi.org/10.25080/majora-1b6fd038-026
# import numpy to generate random integers
import numpy as np
In [40]:
# fill missing values for Age using random values within the range
age_min = 18
age_max = 90
size = df['Age'].isnull().sum()
# create random array using numpy
rand_array = np.random.randint(age_min, age_max, size)
# create DataFrame to fillna with
values = df[df['Age'].isnull()][['Age']]
values.insert(0, 'rand_int', rand_array)
df['Age'].fillna(value=values['rand_int'], inplace=True)
In [41]:
# verify Age does not contain missing values
df['Age'].isnull().sum()
Out[41]:
0In [42]:
df['Age'].hist()
Out[42]:
<Axes: >
In [43]:
# fill missing values for Children, Income, Tenure, Bandwidth_GB_Year using the median
df['Children'].fillna(df['Children'].median(), inplace=True)
df['Income'].fillna(df['Income'].median(), inplace=True)
df['Tenure'].fillna(df['Tenure'].median(), inplace=True)
df['Bandwidth_GB_Year'].fillna(df['Bandwidth_GB_Year'].median(), inplace=True)
In [44]:
# verify Children, Income, Tenure, and Bandwidth_GB_Year do not contain missing values
df[['Children', 'Income', 'Tenure', 'Bandwidth_GB_Year']].isnull().sum()
Out[44]:
Children 0Income 0Tenure 0Bandwidth_GB_Year 0
dtype: int64
In [45]:
df[['Children', 'Income', 'Tenure', 'Bandwidth_GB_Year']].hist()
Out[45]:
array([[<Axes: title={'center': 'Children'}>,
<Axes: title={'center': 'Income'}>], [<Axes: title={'center': 'Tenure'}>, <Axes: title={'center': 'Bandwidth_GB_Year'}>]], dtype=object)
In [46]:
# verify that all missing values have been filled
msno.matrix(df[['Children',
'Age', 'Income', 'Techie', 'Phone', 'TechSupport', 'Tenure', 'Bandwidth_GB_Year' ]])Out[46]:
<Axes: >
In [47]:
# remove outliers from the MonthlyCharge and Email columns
df.drop(index=charge_outliers.index, inplace=True)
df.drop(index=email_outliers.index, inplace=True)
In [48]:
# verify outliers have been removed from MonthlyCharge
df[['MonthlyCharge']].boxplot()
Out[48]:
<Axes: >
In [49]:
# verify outliers have been removed from Email
df[['Email']].boxplot()
Out[49]:
<Axes: >
In [50]:
# view categorical variables
df[[
'Interaction', 'City', 'State', 'County', 'Area', 'Timezone', 'Job', 'Education', 'Employment', 'Marital', 'Gender', 'Churn', 'Techie', 'Contract', 'Port_modem', 'Tablet', 'InternetService', 'Phone', 'Multiple', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'PaymentMethod'
|
Interaction |
City |
State |
County |
Area |
Timezone |
Job |
Education |
Employment |
... |
|
|
0 |
aa90260b-4141-4a24-8e36-b04ce1f4f77b |
Point Baker |
AK |
Prince of Wales-Hyder |
Urban |
America/Sitka |
Environmental health practitioner |
Master's Degree |
Part Time |
... |
|
1 |
fb76459f-c047-4a9d-8af9-e0f7d4ac2524 |
West Branch |
MI |
Ogemaw |
Urban |
America/Detroit |
Programmer, multimedia |
Regular High School Diploma |
Retired |
... |
|
2 |
344d114c-3736-4be5-98f7-c72c281e2d35 |
Yamhill |
OR |
Yamhill |
Urban |
America/Los_Angeles |
Chief Financial Officer |
Regular High School Diploma |
Student |
... |
|
3 |
abfa2b40-2d43-4994-b15a-989b8c79e311 |
Del Mar |
CA |
San Diego |
Suburban |
America/Los_Angeles |
Solicitor |
Doctorate Degree |
Retired |
... |
|
4 |
68a861fd-0d20-4e51-a587-8a90407ee574 |
Needville |
TX |
Fort Bend |
Suburban |
America/Chicago |
Medical illustrator |
Master's Degree |
Student |
... |
]].head(5)
Out[50]:
5 rows × 27 columns
In [51]:
# count unique values in each categorical variable
# will be used to determine which variables can be re-expressed as numerical data
# variables with 3 or fewer unique values will be re-expressed as numerical data
df[[
'Interaction', 'City', 'State', 'County', 'Area', 'Timezone', 'Job', 'Education', 'Employment', 'Marital', 'Gender', 'Churn', 'Techie', 'Contract', 'Port_modem', 'Tablet', 'InternetService', 'Phone', 'Multiple', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'PaymentMethod']].nunique()
Out[51]:
Interaction 9985City 6054State 52County 1620Area 3Timezone 25
Job 639Education 12Employment 5Marital 5Gender 3Churn 2Techie 2Contract 3Port_modem 2
Tablet 2InternetService 3
Phone 2Multiple 2OnlineSecurity 2
OnlineBackup 2
DeviceProtection 2
TechSupport 2
StreamingTV 2
StreamingMovies 2
PaperlessBilling 2
PaymentMethod 4
dtype: int64
In [52]:
# apply ordinal encoding: yes/no values
df.replace(to_replace='No', value=0, inplace=True)
df.replace(to_replace='Yes', value=1, inplace=True)
df[[
'Area', 'Gender', 'Churn', 'Techie', 'Contract', 'Port_modem', 'Tablet', 'InternetService', 'Phone', 'Multiple', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling']].head(5)
Out[52]:
|
Area |
Gender |
Churn |
Techie |
Contract |
Port_modem |
Tablet |
InternetService |
Phone |
Multiple |
… |
|
|
0 |
Urban |
Male |
0 |
0 |
One year |
1 |
1 |
Fiber Optic |
1 |
0 |
… |
|
1 |
Urban |
Female |
1 |
1 |
Month-to-month |
0 |
1 |
Fiber Optic |
1 |
1 |
… |
|
2 |
Urban |
Female |
0 |
1 |
Two Year |
1 |
0 |
DSL |
1 |
1 |
… |
|
3 |
Suburban |
Male |
0 |
1 |
Two Year |
0 |
0 |
DSL |
1 |
0 |
… |
|
4 |
Suburban |
Male |
1 |
0 |
Month-to-month |
1 |
0 |
Fiber Optic |
0 |
0 |
… |
In [53]:
# apply ordinal encoding: Contract variable
contract_values = {'Month-to-month':0,
'One year':1, 'Two Year':2 } df.replace(contract_values, inplace=True)
df[['Contract']].head(5)
Out[53]:
|
Contract |
|
|
0 |
1 |
|
1 |
0 |
|
2 |
2 |
|
3 |
2 |
|
4 |
0 |
In [54]:
# label encoding: Area and Gender variables
# return unique values from Area and Gender
df['Area'].unique(), df['Gender'].unique()
Out[54]:
(array(['Urban', 'Suburban', 'Rural'], dtype=object), array(['Male', 'Female', 'Prefer not to answer'], dtype=object))In [55]:
# apply label encoding for Area and Gender
area_values = {
'Rural':0, 'Suburban':1, 'Urban':2}
gender_values = {
'Female':0, 'Male':1, 'Prefer not to answer':2}
df.replace(area_values, inplace=True)
df.replace(gender_values, inplace=True)
df[['Area', 'Gender']].head(5)
Out[55]:
|
Area |
Gender |
|
|
0 |
2 |
1 |
|
1 |
2 |
0 |
|
2 |
2 |
0 |
|
3 |
1 |
1 |
|
4 |
1 |
1 |
In [56]:
# return unique values from InternetService variable
# will be used to perform one-hot encoding
df['InternetService'].unique()
Out[56]:
array(['Fiber Optic', 'DSL', 'None'], dtype=object)
In [57]:
# create new columns (FiberOptic & DSL) to perform one-hot encoding
df['DSL'] = df['InternetService'].str.contains('DSL')
df['FiberOptic'] = df['InternetService'].str.contains('Fiber Optic')
df.drop(columns='InternetService', inplace=True)
df[['DSL', 'FiberOptic']].head()
Out[57]:
|
DSL |
FiberOptic |
|
|
0 |
False |
True |
|
1 |
False |
True |
|
2 |
True |
False |
|
3 |
True |
False |
|
4 |
False |
True |
In [58]:
# apply ordinal encoding: True/False values
df.replace(to_replace=False, value=0, inplace=True)
df.replace(to_replace=True, value=1, inplace=True)
df[['DSL', 'FiberOptic']].head()
Out[58]:
|
DSL |
FiberOptic |
|
|
0 |
0 |
1 |
|
1 |
0 |
1 |
|
2 |
1 |
0 |
|
3 |
1 |
0 |
|
4 |
0 |
1 |
In [59]:
# data cleaning and data wrangling complete
# save results to csv
# df.to_csv('output files/churn_cleaned_data.csv', header=True)
In [60]:
# install sklearn
# pip install scikit-learn
In [61]:
# Title: scikit-learn/scikit-learn: Scikit-learn# Author: Grisel, et al.# Date: 2023# Code Version: latest# Availability: https://doi.org/10.5281/zenodo.7711792
# import libraries from sklearn to perform PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
In [62]:
# create DataFrame with numerical values for PCA
variables = df[['Zip',
'Lat', 'Lng', 'Population', 'Children', 'Age', 'Income', 'Outage_sec_perweek', 'Email', 'Contacts', 'Yearly_equip_failure', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year', 'item1', 'item2', 'item3', 'item4', 'item5', 'item6', 'item7', 'item8']] variables.head(5)
Out[62]:
|
Zip |
Lat |
Lng |
Population |
Children |
Age |
Income |
Outage_sec_perweek |
Email |
Contacts |
... |
|
|
0 |
99927 |
56.25100 |
-133.37571 |
38 |
1.0 |
68.0 |
28561.990 |
6.972566 |
10 |
0 |
... |
|
1 |
48661 |
44.32893 |
-84.24080 |
10446 |
1.0 |
27.0 |
21704.770 |
12.014541 |
12 |
0 |
... |
|
2 |
97148 |
45.35589 |
-123.24657 |
3735 |
4.0 |
50.0 |
33186.785 |
10.245616 |
9 |
0 |
... |
|
3 |
92014 |
32.96687 |
-117.24798 |
13863 |
1.0 |
48.0 |
18925.230 |
15.206193 |
15 |
2 |
... |
|
4 |
77461 |
29.38012 |
-95.80673 |
11352 |
0.0 |
83.0 |
40074.190 |
8.960316 |
16 |
2 |
... |
5 rows × 22 columns
In [63]:
# standardize df_pca using StandardScaler
scaler = StandardScaler()
scaled_variables = scaler.fit_transform(variables)
In [64]:
# apply PCA
pca = PCA(n_components=variables.shape[1])
pca.fit(scaled_variables)
df_pca = pd.DataFrame(pca.transform(scaled_variables))
In [65]:
loadings = pd.DataFrame(pca.components_.T,
columns = ['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10','PC11', 'PC12','PC13','PC14','PC15','PC16','PC17','PC18','PC19','PC20','PC21','PC22'], index=variables.columns) loadings
Out[65]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 ... PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21 PC22
Zip -0.019116 0.600737 0.362174 0.045941 0.025099 0.010736 0.005911 0.002608 0.005987 0.010813 ... -0.014603 -0.091734 -0.075165 -0.013420 0.012051 -0.002998 -0.003844 -0.011517 0.038715 -0.698097
Lat -0.001275 0.035135 -0.003037 -0.004258 -0.703311 -0.062578 -0.058239 -0.079983 0.005248 0.017169 ... 0.029083 0.518079 0.450483 0.039628 -0.028215 -0.006929 0.014427 0.007467 0.007990 -0.117250
Lng 0.017064 -0.603891 -0.361951 -0.043544 0.061476 -0.000519 0.002413 0.013376 -0.009448 -0.007822 ... 0.003941 -0.020241 -0.018524 0.002366 -0.006369 0.012015 -0.006799 0.008736 0.040580 -0.703302
Population -0.002907 0.052810 0.038574 0.023750 0.675478 0.075009 0.054625 0.063100 0.053312 -0.044313 ... 0.040914 0.585722 0.397697 0.051240 -0.000762 -0.000975 0.000142 0.006246 0.001253 -0.030959
Children 0.001200 -0.027884 -0.019021 0.006778 -0.052752 0.004383 0.618178 0.026195 0.233271 0.275482 ... 0.017570 -0.000869 0.036781 -0.035640 0.017925 0.011433 0.018016 0.008124 -0.018488 0.000591
Age 0.003812 0.005345 -0.027214 -0.015545 0.025882 -0.018872 -0.403601 0.546014 -0.066468 -0.133755 ... 0.107425 0.091274 -0.032657 0.012712 0.006029 -0.015860 -0.003689 -0.009744 0.019560 0.001359
Income -0.000764 -0.002810 0.006632 0.023936 -0.060391 -0.009277 0.127235 0.422815 0.773733 0.053258 ... -0.074420 -0.028976 -0.002353 -0.060704 0.005916 0.000604 0.013878 -0.004736 0.001121 0.001557
Outage_sec_perweek -0.012834 -0.007901 0.019366 -0.048386 -0.086943 0.698695 0.048328 -0.000007 0.021856 0.016149 ... 0.689405 0.034190 -0.121056 0.007236 0.013501 -0.017510 0.010525 0.004356 0.000803 -0.000420
Email 0.008301 -0.000331 -0.021031 -0.004353 0.147656 0.063675 -0.073841 -0.585775 0.161615 0.332985 ... -0.031745 -0.004336 0.075825 -0.021246 0.017482 0.007571 -0.010501 -0.002135 0.005588 -0.002255
Contacts -0.009195 -0.008833 -0.002327 -0.011547 0.026083 0.008642 -0.528664 0.081181 0.119871 0.737195 ... 0.020634 -0.033021 0.027283 -0.033703 0.003469 -0.026186 0.021171 0.000095 -0.002578 0.002228
Yearly_equip_failure -0.007947 0.001629 0.019971 0.007852 -0.011618 0.068137 0.365324 0.396131 -0.541092 0.476411 ... -0.117038 0.018885 0.050222 0.008301 0.012543 -0.001025 0.006993 0.021546 -0.002347 0.002824
Tenure -0.011166 -0.357274 0.602793 -0.070295 -0.002352 -0.060907 -0.018589 -0.001626 -0.000247 -0.005503 ... 0.039228 0.006605 0.006625 -0.009017 -0.007846 -0.012220 0.005619 -0.004574 -0.703731 -0.041122
MonthlyCharge -0.000493 -0.036303 0.030581 -0.026467 -0.064068 0.694935 -0.093904 0.014718 0.019963 -0.096429 ... -0.686072 -0.004627 0.051426 0.009089 0.014360 -0.000205 0.022194 0.012930 -0.048297 0.000268
Bandwidth_GB_Year -0.013105 -0.358116 0.604486 -0.072702 -0.008043 -0.013011 0.002002 -0.008723 0.006063 0.000599 ... -0.010233 -0.001283 0.015863 0.003134 -0.003656 -0.001634 -0.007585 -0.007721 0.705798 0.039252
item1 0.458843 -0.031940 0.018104 0.279372 -0.016332 0.030085 0.004587 0.001286 -0.019097 0.017313 ... -0.006450 0.056520 -0.043136 -0.117461 0.047846 0.025440 -0.240748 -0.792638 -0.003356 0.002142
item2 0.434088 -0.021752 0.036935 0.282285 -0.019949 0.017290 -0.016258 -0.000028 -0.000048 0.009401 ... -0.001782 0.073680 -0.078217 -0.169873 0.069643 0.071636 -0.590359 0.574366 -0.002574 0.003399
item3 0.400897 -0.030673 0.022620 0.280210 -0.002219 -0.014487 -0.003157 -0.029126 -0.024524 -0.016610 ... -0.009815 0.113096 -0.135118 -0.248292 0.147807 -0.395856 0.674390 0.175762 0.014583 -0.005206
item4 0.145778 0.056814 -0.023044 -0.565751 -0.002230 -0.034638 -0.003847 -0.005423 -0.024217 -0.009672 ... -0.021938 0.143279 -0.107621 -0.475279 0.446686 0.431525 0.088801 -0.017993 0.001485 0.001896
item5 -0.175463 -0.064689 0.036925 0.584719 -0.010086 0.027063 -0.040206 0.008504 -0.012674 -0.011120 ... 0.041774 -0.082917 0.098022 0.063028 0.207228 0.693842 0.265534 0.042226 -0.003210 0.003164
item6 0.404561 0.036625 0.008633 -0.181670 0.015243 0.008010 0.003716 0.005575 0.007675 0.024595 ... -0.003672 0.025814 -0.062686 0.056080 -0.758909 0.401923 0.226828 0.064288 0.001496 -0.000415
item7 0.357786 0.020355 0.008126 -0.180606 -0.016924 -0.033971 0.011918 -0.003977 0.055943 0.047546 ... -0.025873 0.068868 -0.162668 0.806930 0.371479 0.068579 0.066199 0.040642 -0.006563 -0.002021
item8 0.308598 0.025249 -0.004541 -0.131236 0.043896 0.039449 -0.019265 0.062519 -0.020641 -0.079420 ... 0.118592 -0.559999 0.725298 -0.013143 0.110858 -0.043083 0.048862 0.042687 -0.003003 -0.002199
22 rows × 22 columns
In [67]:
# calculate covariance, define eigenvalues
cov_matrix = np.dot(scaled_variables.T, scaled_variables)/variables.shape[0]
eigenvalues = [np.dot(eigenvector.T, np.dot(cov_matrix, eigenvector)) for eigenvector in pca.components_]
In [68]:
# Title: matplotlib/matplotlib# Author: Caswell, et al.# Date: 2023# Code Version: latest# Availability: https://doi.org/10.5281/zenodo.7697899
# import matplotlib for plots
import matplotlib.pyplot as plt
:In [69]:
# plot eigenvalues against number of components
plt.plot(eigenvalues)
plt.xlabel('number of components')
plt.ylabel('eigenvalue')
plt.axhline(y=1, color='r')
plt.show()

No steps were taken to handle
duplicate values.
Depending on the datatype and
characteristics of the data, missing values were filled using either the
median, mode, or random values. Features of the qualitative datatype were
filled using the mode. Quantitative data with a uniform distribution was
imputed using random values. Quantitative data with a skewed distribution was
imputed using the median.
Verification of this data
cleaning step was performed in two stages. First, the pandas.DataFrame.isnull.sum()
function was applied on all columns that originally contained missing values.
For each feature that this function was applied on, zero was returned as the
sum of null values, indicating that all missing values were filled.
Next, visualizations were
produced to understand the changes in the dataset. Matrices of missing values
were generated using the msno.matrix()
function both before and after applying the data cleaning transformations
(refer to Section C.4:Code and Section D.3:Code, respectively). While sparsity
was observed in the original matrix, it was not observed in the subsequent
matrix.
Outliers were discarded from the
dataset if it was determined that they were the result of error or illegitimate
entry. In this case, the outliers found in the MonthlyCharge
and Email features were removed from the dataset.
Visualizations were used to verify
this data cleaning step. Boxplots were used to identify the outliers contained
in the MonthlyCharge and Email features using the pandas.DataFrame.boxplot()
function. After applying the data cleaning transformations, subsequent boxplots
showed fewer outliers in both features.
To view the results of the data
cleaning process, refer to the attached churn_cleaned_data.csv.
Given that the pandas.DataFrame.duplicated()
function was only applied across records and Customer_id
values, it is possible that some instances of duplication were not detected.
For example, if there was an instance of duplication that occurred across the
combination of City, State, and Interaction features, the methods used in
Section C.4:Code would not have identified this data
quality issue.
Univariate imputation (filling
NA/NaN values using the mean, median, and mode) was
the dominant method for handling missing values. The drawback of univariate
imputation is that it can distort the distribution of the data
Outliers that were deemed
erroneous were dropped from the dataset, thus reducing the sample size and potentially eliminating valuable data points from
the dataset. For the majority of features, however,
all outliers were retained; the disadvantage of retaining outliers is that it
can lead to the distortion of statistical models
As noted in Section A, the
research question for this project is "Do customers enrolled in more
service offerings display lower levels of churn?" To answer this research
question, the data analyst will need to analize the
relevant data, which includes:
None of these features were
determined to contain missing values, and since these features are of the
categorical data type, cannot contain outliers. For this reason, the limitations
discussed in Section D.6 regarding missing values and outliers would not affect
the analysis of the research question. On the other hand, the limitation
regarding the identification of duplicates could impact the analysis; if
unidentified duplicates exist in the dataset, then conclusions drawn from the
data might be distorted or inaccurate.
A Principal Component Analysis (PCA)
was performed on all numeric data (refer to Section D.3: Code). The results of
the PCA indicate that there are 6 principal components in the dataset.
The Kaiser Rule recommends retaining
principal components with eigenvalues greater than 1
PCA is a data mining technique used
to reduce the dimensionality of a dataset. This is beneficial for several
reasons. Firstly, dimensionality reduction prevents a modeling error in
statistics called overfitting. Next, a reduced dataset means that machine
learning algorithms take less time to train. Finally, PCA improves data
visualization by reducing the amount of features that
need to be visualized
To view a walkthrough
demonstration of the functionality of the code, refer to the Panopto link: ...
Bilogur, et al. (2018, February).
doi:10.5281/zenodo.1184723
Caswell, et al. (2023, March).
doi:10.5281/zenodo.7697899
Gommers, et al. (2023, February).
doi:10.5281/zenodo.7655153
Grisel, et al. (2023, March).
doi:10.5281/zenodo.7711792
Pawson, I. (2021, August).
doi:10.25080/majora-1b6fd038-026
The pandas development team. (2023, March).
doi:zenodo.7741580
Larose, D. T., & Larose, C. D. (2019). Data
Science Using Python and R. Hoboken: Wiley. Retrieved March 2023
pandas. (2023). API reference. Retrieved March
2023, from pandas: https://pandas.pydata.org/docs/reference/index.html
The SciPy Community. (2023). SciPy API.
Retrieved March 2023, from SciPy:
https://docs.scipy.org/doc/scipy/reference/index.html